Decoding Chaos: Initial Data Exploratory Analysis (IDEA)
Author
Teo Suan Ern
Published
February 27, 2024
Modified
March 1, 2024
1. Overview
1.1 Project Brief
Take-home exercise 4 is the preliminary work of the final group project. Armed conflicts due to political violence and coordinated attacks targeting innocent civilians, have been on the rise globally. This threatens the public at both physical and psychological levels. A good visual analysis of armed conflicts is essential to help (1) discover armed conflicts trends and (2) conceptualise armed conflict spaces.
The project team consists of three members, and each member will take one of the main prototype modules as follows:
Exploratory Data Analysis (Initial & Geospatial)
Spatial Point Pattern Analysis
Multivariate Clustering Analysis
1.2 Project Objectives
The project will be using open-source data from armed conflict events (Armed Conflict Location & Event Data Project (ACLED)). The objective of my assignment is to build a prototype and user interface (UI) design on Exploratory Data Analysis (Initial & Geospatial) that provides easy-to-use and insightful visualisation tools that can be suitable for Defence and Security sectors to develop effective counter measures and strategies.
1.3 Exploratory Data Analysis
This project is separated into two segments:
Initial Data Exploratory Analysis (IDEA) – Current Page
The Exploratory Data Analysis is developed into two main portions.
Initial Data Exploratory Analysis allows users to select different variables and perform initial exploration on the dataset to discover distribution and trends of armed conflicts in Myanmar.
Geospatial Data Exploratory Analysis allows users to select different variables and perform spatial exploration on the dataset to conceptualise armed conflict spaces in Myanmar.
The initial proposed layouts and features of the three sections under Exploratory are as follows:
Overview
Geospatial Exploration
Data Analysis
3. Initial Data Preparation
3.1 Install and launch R packages
The project uses p_load() of pacman package to check if the R packages are installed in the computer.
The following code chunk is used to install and launch the R packages.
Show code
pacman::p_load(tidyverse, kableExtra, leaflet, rmarkdown, knitr, highcharter, # timeseries highchart viridis, ggthemes, ggplot2, tidyr, dplyr, viridisLite, RColorBrewer, calendR, # calendar lubridate, # convert date from char to date format wordcloud, tidytext, # word cloud ggforce, # boxplot countrycode, sf, spdep, tmap, leaflet, # geospatial tm, plotly)
data <-read.csv("data/1900-01-01-2024-02-26-Southeast_Asia-Myanmar.csv")
3.3 Overview of the data
The combined data consists of 113,177 observations and 30 variables. Each row details the armed conflict event on the type, agents, location, date and other characteristics of conflict events (such as political violence, demonstration) in Myanmar.
Dataset Structure
Use str() to check the structure of the data.
str(data)
'data.frame': 55574 obs. of 35 variables:
$ event_id_cnty : chr "MMR56099" "MMR56222" "MMR56370" "MMR56376" ...
$ event_date : chr "31-Dec-23" "31-Dec-23" "31-Dec-23" "31-Dec-23" ...
$ year : int 2023 2023 2023 2023 2023 2023 2023 2023 2023 2023 ...
$ time_precision : int 1 1 1 1 1 1 1 1 1 1 ...
$ disorder_type : chr "Political violence" "Political violence" "Political violence" "Demonstrations" ...
$ event_type : chr "Explosions/Remote violence" "Explosions/Remote violence" "Battles" "Protests" ...
$ sub_event_type : chr "Shelling/artillery/missile attack" "Shelling/artillery/missile attack" "Armed clash" "Peaceful protest" ...
$ actor1 : chr "Military Forces of Myanmar (2021-)" "Military Forces of Myanmar (2021-)" "Phoenix DF: Phoenix Defense Force (Nattalin)" "Protesters (Myanmar)" ...
$ assoc_actor_1 : chr "" "" "" "" ...
$ inter1 : int 1 1 3 6 1 1 3 1 2 1 ...
$ actor2 : chr "" "Civilians (Myanmar)" "Military Forces of Myanmar (2021-)" "" ...
$ assoc_actor_2 : chr "" "" "" "" ...
$ inter2 : int 0 7 1 0 7 0 1 0 1 7 ...
$ interaction : int 10 17 13 60 17 10 13 10 12 17 ...
$ civilian_targeting: chr "" "Civilian targeting" "" "" ...
$ iso : int 104 104 104 104 104 104 104 104 104 104 ...
$ region : chr "Southeast Asia" "Southeast Asia" "Southeast Asia" "Southeast Asia" ...
$ country : chr "Myanmar" "Myanmar" "Myanmar" "Myanmar" ...
$ admin1 : chr "Mon" "Rakhine" "Bago-West" "Sagaing" ...
$ admin2 : chr "Mawlamyine" "Maungdaw" "Thayarwady" "Yinmarbin" ...
$ admin3 : chr "Ye" "Maungdaw" "Nattalin" "Salingyi" ...
$ location : chr "Aing Shey" "Kaing Gyi (NaTaLa)" "Kyauk Pyoke" "Let Pa Taung" ...
$ latitude : num 15.3 20.7 18.6 22.1 18.6 ...
$ longitude : num 98 92.4 95.8 95.1 95.8 ...
$ geo_precision : int 1 2 2 2 1 1 1 2 2 1 ...
$ source : chr "Democratic Voice of Burma" "Development Media Group; Narinjara News" "Khit Thit Media; Myanmar Pressphoto Agency" "Myanmar Labour News" ...
$ source_scale : chr "National" "Subnational" "National" "National" ...
$ notes : chr "On 31 December 2023, in Aing Shey village (Ye township, Mawlamyine district, Mon state), following a clash betw"| __truncated__ "On 31 December 2023, in Kaing Gyi (Mro) village (coded as Kaing Gyi (NaTaLa)) (Maungdaw township, Maungdaw dist"| __truncated__ "On 31 December 2023, near Kyauk Pyoke village (Nattalin township, Thayarwady district, Bago-West region), the P"| __truncated__ "On 31 December 2023, in the Let Pa Taung area of Salingyi township (Yinmarbin district, Sagaing region), protes"| __truncated__ ...
$ fatalities : int 0 0 4 0 0 0 3 0 0 0 ...
$ tags : chr "" "" "" "crowd size=no report" ...
$ timestamp : int 1704831212 1704831213 1704831214 1704831214 1704831214 1704831216 1704831216 1704831216 1704831216 1704831216 ...
$ population_1km : int NA NA NA 749 NA 178 6634 671 687 35292 ...
$ population_2km : int NA NA NA 521 NA 135 19078 2197 654 85732 ...
$ population_5km : int 3081 NA NA 1358 NA NA 34396 3144 656 169473 ...
$ population_best : int 3081 NA NA 749 NA NA 34396 3144 656 85732 ...
The output above reveals that event_date is in character format instead of date format.
Use colSums to check for missing values
The output below shows that there are three variables with missing values. A quick check of the dataset reveals that population data is only available from Year 2020, which explains the reason for so many missing values.
[1] event_id_cnty event_date year time_precision
[5] disorder_type event_type sub_event_type actor1
[9] assoc_actor_1 inter1 actor2 assoc_actor_2
[13] inter2 interaction civilian_targeting iso
[17] region country admin1 admin2
[21] admin3 location latitude longitude
[25] geo_precision source source_scale notes
[29] fatalities tags timestamp population_1km
[33] population_2km population_5km population_best
<0 rows> (or 0-length row.names)
4. Data Wrangling
The flowchart diagram below provides an overview of the key variables used in this project.
flowchart TD
A(Key Variables Used \n event_id_cnty)
A --> B(Time Period)
A --> C(Characteristic of Incident)
A --> D(Location)
B --> E(year)
B --> F(date)
B -.-> G(New Variables)
G -.-> H(day)
G -.-> I(week number)
G -.-> J(month)
C --> K(event_type)
C --> L(sub_event_type)
C --> M(actor1)
C --> N(actor2)
C --> O(fatalities)
C -.-> P(New Variables)
P -.-> Q(total incidents)
P -.-> R(total fatalities)
P -.-> S(political violence rate)
P -.-> T(violence against civilian rate)
P -.-> U(territory exchange rate \n-non-state exchange)
P -.-> V(territory exchange rate \n-government regains territory)
D --> W(country)
D --> X(longitude)
D --> Y(latitude)
D --> Z(admin1)
D --> AA(admin2)
D --> AB(admin3)
D --> AC(geometry points)
D -.-> AD(New Variables)
AD -.-> AE(shapeID)
4.1 Convert event_date format
The code chunk below uses dmy() convert to date format from character to date format:
Show code
data$event_date <-dmy(data$event_date)
4.2 Create new variables
The code chunk below creates the following new variables based on total armed conflict incidents and total fatalities (by disorder_type and sub_event_type):
The code chunk below save dataset in .rds format for subsequent geospatial EDA.
Show code
write_rds(final, "data/final.rds")
Use str() to check the structure of the final dataset.
str(final)
tibble [13,177 × 30] (S3: tbl_df/tbl/data.frame)
$ event_id_cnty : chr [1:13177] "MMR56370" "MMR56871" "MMR56878" "MMR56900" ...
$ event_date : Date[1:13177], format: "2023-12-31" "2023-12-31" ...
$ year : int [1:13177] 2023 2023 2023 2023 2023 2023 2023 2023 2023 2023 ...
$ disorder_type : chr [1:13177] "Political violence" "Political violence" "Political violence" "Political violence" ...
$ event_type : chr [1:13177] "Battles" "Battles" "Explosions/Remote violence" "Battles" ...
$ sub_event_type : chr [1:13177] "Armed clash" "Armed clash" "Air/drone strike" "Armed clash" ...
$ actor1 : chr [1:13177] "Phoenix DF: Phoenix Defense Force (Nattalin)" "MSRF: Mon State Revolutionary" "Military Forces of Myanmar (2021-)" "MSPDF: Myaung Special People's Defense Force" ...
$ inter1 : int [1:13177] 3 3 1 3 3 1 3 1 2 3 ...
$ actor2 : chr [1:13177] "Military Forces of Myanmar (2021-)" "Military Forces of Myanmar (2021-)" "Civilians (Myanmar)" "Military Forces of Myanmar (2021-)" ...
$ inter2 : int [1:13177] 1 1 7 1 1 7 1 7 1 1 ...
$ interaction : int [1:13177] 13 13 17 13 13 17 13 17 12 13 ...
$ civilian_targeting : chr [1:13177] "" "" "Civilian targeting" "" ...
$ iso : int [1:13177] 104 104 104 104 104 104 104 104 104 104 ...
$ region : chr [1:13177] "Southeast Asia" "Southeast Asia" "Southeast Asia" "Southeast Asia" ...
$ country : chr [1:13177] "Myanmar" "Myanmar" "Myanmar" "Myanmar" ...
$ admin1 : chr [1:13177] "Bago-West" "Mon" "Sagaing" "Sagaing" ...
$ admin2 : chr [1:13177] "Thayarwady" "Mawlamyine" "Katha" "Monywa" ...
$ admin3 : chr [1:13177] "Nattalin" "Ye" "Tigyaing" "Chaung-U" ...
$ location : chr [1:13177] "Kyauk Pyoke" "Kyaung Ywar" "Kan Pauk" "Chaung-U" ...
$ latitude : num [1:13177] 18.6 15.3 23.9 22 21.3 ...
$ longitude : num [1:13177] 95.8 98 96.1 95.3 95.4 ...
$ source : chr [1:13177] "Khit Thit Media; Myanmar Pressphoto Agency" "Democratic Voice of Burma; Khit Thit Media; Myanmar Pressphoto Agency" "Democratic Voice of Burma; Khit Thit Media; Myanmar Pressphoto Agency; Radio Free Asia" "Khit Thit Media; Myanmar Pressphoto Agency" ...
$ notes : chr [1:13177] "On 31 December 2023, near Kyauk Pyoke village (Nattalin township, Thayarwady district, Bago-West region), the P"| __truncated__ "On 31 December 2023, in Kyaung Ywar village (Ye township, Mawlamyine district, Mon state), a combined force of "| __truncated__ "On 31 December 2023, in Kan Pauk village (Tigyaing township, Katha district, Sagaing region), the Myanmar milit"| __truncated__ "On 31 December 2023, in Chaung-U town (Chaung-U township, Monywa district, Sagaing region), a combined force of"| __truncated__ ...
$ fatalities : int [1:13177] 4 3 1 1 2 1 3 1 1 4 ...
$ total_fata : int [1:13177] 15716 15716 15716 15716 15716 15716 15716 15716 15716 15716 ...
$ total_inci : int [1:13177] 4054 4054 4054 4054 4054 4054 4054 4054 4054 4054 ...
$ political_rate : num [1:13177] 99.5 99.5 99.5 99.5 99.5 ...
$ civilian_rate : num [1:13177] 23.3 23.3 23.3 23.3 23.3 ...
$ non_state_exchange : num [1:13177] 0.02 0.02 0.02 0.02 0.02 0.02 0.02 0.02 0.02 0.02 ...
$ govt_regain_exchange: num [1:13177] 0 0 0 0 0 0 0 0 0 0 ...
5. Initial Exploratory Data Analysis
5.1 Descriptive Statistics
Before proceeding with data visualisation, it is essential to be able to navigate the dataset of 13,177 observations and 30 variables with ease. This segment will help users identify or navigate through the dataset observations instead of scrolling through each observation one-by-one. The interactive datatable is created using DT package.
Design Features - Interactive Data Table
Display number of observations by selecting the dropdown (5, 10, 25, 50, 100 entries). This ensure that the observations will not span across the entire webpage.
View other pages of observations with “previous” or “next” button.
Search specific observations with the search bar for the occurrence of a string/ numerical value in any column of an observation
Filter observations with the filter bar directly below column headers.
Column visibility allows user to select the columns that they are interested to view and hide the rest
Distribution of armed conflicts and fatalities over the years in Myanmar
Design Features
The prototype proposes to include to the following interactivity elements for users’ data exploratory:
Dropbox filters such as event_type and actor1
Radio button selection on total armed conflicts or total fatalities
Slider bar to select the years
Checkbox selection to filter/ select by sub-national administrative region 1
forcats::fct_infreq is used to assign frequency values to factor levels while visualising it over time period.
geom_boxplot() with the use of ggplotly provides statistical information such as minimum, maximum, mean, median, first-and-third-quantile values when hover-over.
geom_sina() is useful for plotting single variable in a multiclass dataset to show density distribution within each class.
Show code
box1 <-ggplot(final, aes(x = forcats::fct_infreq(admin1), y = event_date, color =factor(admin1), fill =factor(admin1))) +geom_sina(method ="density", alpha = .3) +geom_boxplot(width = .2, color ="#000000", fill =NA, size = .5, outlier.shape =NA, position =position_nudge(.25)) +coord_flip()+theme(legend.position ="none", plot.title.position ="plot") +labs(title ="Frequency of Conflict Has Increased Over Time in Most Administrative Regions", subtitle ="Year 2010 to Year 2023") +labs(y ="Year (2010-2023)",x ="Adminstrative Region 1", caption ="Data Source: ACLED (2023)")ggplotly(box1)
5.3 Timeseries Analysis
Trend of armed conflicts and fatalities in Myanmar
Design Features
The prototype proposes to include to the following interactivity elements for users’ data exploratory:
Dropbox filters such as event_type and actor1
Radio button selection on total armed conflicts or total fatalities
Slider bar to select the years
Checkbox selection to filter/ select by sub-national administrative region 1
Information such as year, count of armed conflicts and fatalities will be available when hover-over.
Show code
year_fata <- final %>%filter(fatalities >0) %>%group_by(year) %>%select(year, fatalities) %>%summarise(total_fata =sum(fatalities),total_inci =n()) %>%ungroup()hc_plot1 <-highchart() %>%hc_add_series(year_fata, hcaes(x = year, y = total_fata), type ="line", name ="Total Fatalities", color ="lightcoral") %>%hc_add_series(year_fata, hcaes(x = year, y = total_inci), type ="line", name ="Total Incidents", color ="black") %>%hc_tooltip(crosshairs =TRUE, borderWidth =1.5, headerFormat ="", backgroundColor ="#FCFFC5",borderWidth =5,pointFormat ="Year: <b>{point.year}</b> <br> Fatalities: <b>{point.total_fata}</b> <br> Incidents: <b>{point.total_inci}</b>" ) %>%hc_title(text ="Armed Conflict Over The Years") %>%hc_subtitle(text ="2010 to 2023") %>%hc_xAxis(title =list(text ="Year")) %>%hc_yAxis(title =list(text ="Frequency"),allowDecimals =FALSE,plotLines =list(list(color ="lightcoral", width =1, dashStyle ="Dash",value =mean(year_fata$total_fata),label =list(text =paste("Average fatalities:", round(mean(year_fata$total_fata))),style =list(color ='lightcoral', fontSize =20))))) %>%hc_add_theme(hc_theme_flat())hc_plot1
Calendar visualisation of armed conflicts and fatalities
Design Features
The prototype proposes to include to the following interactivity elements for users’ data exploratory:
Dropbox filters on years
geom_boxplot() with the use of ggplotly provides statistical information such as minimum, maximum, mean, median, first-and-third-quantile values when hover-over.
geom_sina() is useful for plotting single variable in a multiclass dataset to show density distribution within each class.
The code chunk below derives new variables by using weekdays(), mday(), months() and isoweek().